|IS 608: KNOWLEDGE & VISUAL ANALYTICS - PROJECT PROPOSAL | Data Analytics
The goal of this project is to:
The project would lay more emphasis on the explanatory techniques. It will be used in making data presentation to the viewers in a more succinct way. I therefore plan to use the R programing language to explore and analysis the dataset.
The dataset to be used is the World Health Nutrition and Population Statistics from year 2010 to 2016 . This can be obtained from http://databank.worldbank.org/data/reports.aspx?source=health-nutrition-and-population-statistics#advancedDownloadOptions .
Load this libraries and dataset and lets get to work!
suppressMessages(library(knitr))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(sqldf))
suppressPackageStartupMessages(library(googleVis))## Creating a generic function for 'toJSON' from package 'jsonlite' in package 'googleVis'
df <- read.csv("World_Health_2.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
kable(head(df[200:206, ]))| Series_Name | Series_Code | Country_Name | Country_Code | YR1960 | YR1970 | YR1980 | YR1990 | YR2000 | YR2010 | YR2015 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 200 | Children (ages 0-14) newly infected with HIV | SH.HIV.INCD.14 | Saudi Arabia | SAU | .. | .. | .. | 100 | 100 | 100 | 100 |
| 201 | Children (ages 0-14) newly infected with HIV | SH.HIV.INCD.14 | Senegal | SEN | .. | .. | .. | 200 | 1000 | 1000 | 500 |
| 202 | Children (ages 0-14) newly infected with HIV | SH.HIV.INCD.14 | Serbia | SRB | .. | .. | .. | .. | .. | .. | .. |
| 203 | Children (ages 0-14) newly infected with HIV | SH.HIV.INCD.14 | Seychelles | SYC | .. | .. | .. | .. | .. | .. | .. |
| 204 | Children (ages 0-14) newly infected with HIV | SH.HIV.INCD.14 | Sierra Leone | SLE | .. | .. | .. | 200 | 500 | 1000 | 500 |
| 205 | Children (ages 0-14) newly infected with HIV | SH.HIV.INCD.14 | Singapore | SGP | .. | .. | .. | .. | .. | .. | .. |
lat_long <- read.csv("Countries_long_lat2.csv", header = TRUE, sep = ",")
colnames(lat_long) <- c("Country", "Country_Code", "Latitude", "Longtitude")
kable(head(lat_long))| Country | Country_Code | Latitude | Longtitude |
|---|---|---|---|
| Albania | ALB | 41.0000 | 20.0000 |
| Algeria | DZA | 28.0000 | 3.0000 |
| American Samoa | ASM | -14.3333 | -170.0000 |
| Andorra | AND | 42.5000 | 1.6000 |
| Angola | AGO | -12.5000 | 18.5000 |
| Anguilla | AIA | 18.2500 | -63.1667 |
Cleaning and renaming of dataset and column respectively.
options(warn = -1)
df2 <- merge(df, lat_long, by.x = "Country_Code", by.y = "Country_Code", all = FALSE)
df2[, 5:11] <- sapply(df2[, 5:11], as.numeric)Merging column lonitude and Latitude together for a better coordinate to be in maps (googlevis)
df2$Lat_Long = paste(df2$Latitude, df2$Longtitude, sep=":")we are now to goint make use of sql to subset(query) columns so as to diffentiate between year 2000 and 2010 where the number children orphaned by HIV/AIDS more than 50000.
sq <- sqldf("select Lat_Long, Country_Name, YR2000, YR2010 from df2
where YR2010 >= 50000 and YR2000 >= 50000 and Series_Name ='Children orphaned by HIV/AIDS'
order by YR2010, YR2000 limit 20")
head(sq)## Lat_Long Country_Name YR2000 YR2010
## 1 -10:-76 Peru 61000 54000
## 2 -1:15 Congo, Rep. 52000 70000
## 3 23:-102 Mexico 66000 71000
## 4 17:-4 Mali 51000 91000
## 5 2.5:112.5 Malaysia 51000 96000
## 6 19:-72.4167 Haiti 110000 100000
Show_map <- gvisMap(sq, "Lat_Long" , "Country_Name",
options=list(showTip=TRUE, mapType='normal',
enableScrollWheel=TRUE,
icons=paste0("{",
"'default': {'normal': 'http://icons.iconarchive.com/",
"icons/icons-land/vista-map-markers/48/",
"Map-Marker-Ball-Azure-icon.png',\n",
"'selected': 'http://icons.iconarchive.com/",
"icons/icons-land/vista-map-markers/48/",
"Map-Marker-Ball-Right-Azure-icon.png'",
"}}")))
plot(Show_map)## starting httpd help server ... done
Note that this map was saved to my local file. For an interactive, kindly run the code above
g <- ggplot(sq, aes(YR2010, YR2000)) +
geom_line(aes(colour = Country_Name)) +
labs(title = "Top 20 Countries Where Children Orphaned by HIV/AIDS For YRS 2000 & 2010 ") +
geom_smooth(se = TRUE)
ggplotly(g)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'